Classification using speed dating dataset

The “Business Decision”

We would like to understand who would be the most likely champion of speed dating as well as what would be the key drivers that affect people’s decision in selecting.

The Data

We used the data from speed dating experiment conducted at Columbia Business School available on kaggle This is how the first 5 out of the total of 4299 rows look:

01 02 03 04 05
attr_o 6 6 7 8 6
sinc_o 7 5 7 8 6
intel_o 8 10 7 9 7
fun_o 7 6 9 8 7
amb_o 7 6 9 8 8
shar_o 5 5 9 9 7
field_cd 1 1 1 1 1
race 2 2 2 2 2
goal 1 1 1 1 1
date 5 5 5 5 5
go_out 1 1 1 1 1
career_c 1 1 1 1 1
sports 1 1 1 1 1
tvsports 1 1 1 1 1
exercise 6 6 6 6 6
dining 7 7 7 7 7
museums 6 6 6 6 6
art 7 7 7 7 7
hiking 7 7 7 7 7
gaming 5 5 5 5 5
clubbing 7 7 7 7 7
reading 7 7 7 7 7
tv 7 7 7 7 7
theater 9 9 9 9 9
movies 7 7 7 7 7
concerts 8 8 8 8 8
music 7 7 7 7 7
shopping 1 1 1 1 1
yoga 8 8 8 8 8

A Process for Classification

Classification in 6 steps

We followed the following approach to proceed with classification (as explained in class)

  1. Create an estimation sample and two validation samples by splitting the data into three groups. Steps 2-5 below will then be performed only on the estimation and the first validation data. You should only do step 6 once on the second validation data, also called test data, and report/use the performance on that (second validation) data only to make final business decisions.
  2. Set up the dependent variable (as a categorical 0-1 variable; multi-class classification is also feasible, and similar, but we do not explore it in this note).
  3. Make a preliminary assessment of the relative importance of the explanatory variables using visualization tools and simple descriptive statistics.
  4. Estimate the classification model using the estimation data, and interpret the results.
  5. Assess the accuracy of classification in the first validation sample, possibly repeating steps 2-5 a few times in different ways to increase performance.
  6. Finally, assess the accuracy of classification in the second validation sample. You should eventually use/report all relevant performance measures/plots on this second validation sample only.

Let’s follow these steps.

Step 1: Split the data

We have three data samples: estimation_data (e.g. 80% of the data in our case), validation_data (e.g. the 10% of the data) and test_data (e.g. the remaining 10% of the data).

In our case we use 3439 observations in the estimation data, 430 in the validation data, and 430 in the test data.

Step 2: Choose dependent variable

<<<<<<< HEAD

Our dependent variable is: dec_o. It states whether given subject was selected by the .In our data the number of 0/1’s in our estimation sample is as follows.

=======

Our dependent variable is: dec_o. It states whether given subject was selected by the partner. In our data the number of 0/1’s in our estimation sample is as follows.

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Class 1 Class 0
# of Observations 1490 1949

while in the validation sample they are:

Class 1 Class 0
# of Observations 203 227

Step 3: Simple Analysis

Below are the statistics of our independent variables across the two classes, class 1, “selected”

min 25 percent median mean 75 percent max std
attr_o 1 6 7 7.27 8 10 1.53
sinc_o 0 7 8 7.65 9 10 1.46
intel_o 3 7 8 7.73 9 10 1.29
fun_o 0 6 7 7.26 8 10 1.54
amb_o 2 6 7 7.11 8 10 1.56
shar_o 0 5 7 6.48 8 10 1.81
field_cd 1 3 8 7.21 10 17 3.90
race 1 2 2 2.69 4 6 1.23
goal 1 1 2 2.08 2 6 1.38
date 1 4 5 4.89 6 7 1.46
go_out 1 1 2 2.02 3 7 1.00
career_c 1 2 5 5.03 7 17 3.22
sports 1 5 7 6.50 9 10 2.62
tvsports 1 2 4 4.47 7 10 2.76
exercise 1 5 7 6.31 8 10 2.50
dining 3 7 8 7.77 9 10 1.75
museums 1 6 7 6.95 8 10 2.02
art 1 5 7 6.70 8 10 2.28
hiking 0 4 6 5.81 8 10 2.53
gaming 1 1 4 3.73 5 10 2.36
clubbing 1 4 6 5.78 8 10 2.51
reading 1 7 8 7.68 9 10 1.89
tv 1 3 5 5.09 7 10 2.52
theater 1 5 7 6.80 9 10 2.27
movies 2 7 8 7.92 9 10 1.69
concerts 1 6 7 6.92 9 10 2.12
music 1 7 8 7.88 9 10 1.76
shopping 1 4 6 5.67 8 10 2.61
yoga 1 2 4 4.39 7 10 2.70

and class 0, “not selected”:

min 25 percent median mean 75 percent max std
attr_o 0 4 6 5.37 7 10 1.77
sinc_o 0 6 7 6.85 8 10 1.86
intel_o 0 6 7 7.06 8 10 1.60
fun_o 0 5 6 5.65 7 10 1.96
amb_o 0 5 7 6.50 8 10 1.83
shar_o 0 3 5 4.74 6 10 2.05
field_cd 1 5 8 7.24 10 17 3.66
race 1 2 2 2.75 4 6 1.20
goal 1 1 2 2.26 3 6 1.51
date 1 4 5 5.06 6 7 1.40
go_out 1 1 2 2.23 3 7 1.18
career_c 1 2 5 4.94 7 17 3.20
sports 1 4 7 6.29 9 10 2.69
tvsports 1 2 4 4.55 7 10 2.87
exercise 1 4 6 6.00 8 10 2.53
dining 1 7 8 7.66 9 10 1.85
museums 1 5 7 6.90 8 10 2.10
art 1 5 7 6.62 8 10 2.25
hiking 0 3 6 5.69 8 10 2.68
gaming 1 1 4 3.86 6 10 2.47
clubbing 1 4 6 5.46 7 10 2.37
reading 1 7 8 7.71 9 10 1.98
tv 1 3 6 5.27 7 10 2.43
theater 1 5 7 6.90 9 10 2.09
movies 2 7 8 8.00 9 10 1.56
concerts 1 6 7 6.82 8 10 2.15
music 1 7 8 7.74 9 10 1.78
shopping 1 3 5 5.42 8 10 2.64
yoga 1 2 4 4.33 7 10 2.73

A simple visualization of values is presented below using the box plots. These visually indicate simple summary statistics of an independent variable (e.g. mean, median, top and bottom quantiles, min, max, etc). For example, for class 0

<<<<<<< HEAD <<<<<<< HEAD

and class 1:

=======

and class 1:

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f =======

and class 1:

>>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a

Step 4: Classification and Interpretation

For our assignent, we used three classification methods: logistic regression, classification and regression trees (CART) and machine learning (i.e. random forests).

Running a basic CART model with complexity control cp=0.01, leads to the following tree:

<<<<<<< HEAD <<<<<<< HEAD

=======

=======

>>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a

Where the key decisions criteria could be explained by the following table.

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f
Attribute Name
IV1 attr_o
IV4 fun_o
IV6 shar_o
IV5 amb_o
IV2 sinc_o
IV3 intel_o
IV27 music
<<<<<<< HEAD

One can estimate larger trees through changing the tree’s complexity control parameter (in this case the rpart.control argument cp). For example, this is how the tree would look like if we set cp = 0.005:

=======

For example, this is how the tree would look like if we set cp = 0.005:

<<<<<<< HEAD

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f =======

>>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a
Attribute Name
IV1 attr_o
IV4 fun_o
IV6 shar_o
IV5 amb_o
IV2 sinc_o
IV3 intel_o
IV21 clubbing
IV16 dining
IV27 music

Below we present the probability our validation data belong to class 1. For the first few validation data observations, using the first CART above, is:

Actual Class Probability of Class 1
Obs 1 1 0.61
Obs 2 1 0.22
Obs 3 0 0.22
Obs 4 0 0.22
Obs 5 1 0.22

Logistic Regression is a method similar to linear regression except that the dependent variable can be discrete (e.g. 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:

Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.1 0.5 -9.8 0.0
attr_o 0.6 0.0 17.5 0.0
sinc_o -0.1 0.0 -1.7 0.1
intel_o 0.0 0.0 0.2 0.8
fun_o 0.2 0.0 7.1 0.0
amb_o -0.2 0.0 -4.7 0.0
shar_o 0.3 0.0 10.1 0.0
field_cd 0.0 0.0 -1.4 0.2
race 0.1 0.0 1.7 0.1
goal 0.0 0.0 -1.4 0.2
date 0.0 0.0 0.5 0.6
go_out -0.1 0.0 -1.3 0.2
career_c 0.0 0.0 -0.2 0.9
sports 0.0 0.0 0.8 0.4
tvsports 0.0 0.0 -1.7 0.1
exercise 0.0 0.0 1.0 0.3
dining 0.0 0.0 -0.9 0.4
museums 0.0 0.0 -0.4 0.7
art 0.0 0.0 0.8 0.4
hiking 0.0 0.0 1.1 0.3
gaming 0.0 0.0 -1.5 0.1
clubbing 0.0 0.0 -1.0 0.3
reading 0.0 0.0 0.0 1.0
tv 0.0 0.0 -0.5 0.6
theater 0.0 0.0 -1.4 0.2
movies 0.0 0.0 -0.2 0.9
concerts 0.0 0.0 0.2 0.8
music 0.0 0.0 -0.2 0.9
shopping 0.0 0.0 2.2 0.0
yoga 0.0 0.0 -1.9 0.1
<<<<<<< HEAD

Random forests is another technique that was used…

Given a set of independent variables, the output of the estimated logistic regression (the sum of the products of the independent variables with the corresponding regression coefficients) can be used to assess the probability an observation belongs to one of the classes. Specifically, the regression output can be transformed into a probability of belonging to, say, class 1 for each observation. In our case, the probability our validation data belong to class 1 (e.g. the customer is likely to purchase a boat) for the first few validation data observations, using the logistic regression above, is:

Actual Class Probability of Class 1
Obs 1 1 0.56
Obs 2 1 0.18
Obs 3 0 0.67
Obs 4 0 0.68
Obs 5 1 0.10

The default decision is to classify each observation in the group with the highest probability - but one can change this choice, as we discuss below.

Selecting the best subset of independent variables for logistic regression, a special case of the general problem of feature selection, is an iterative process where both the significance of the regression coefficients as well as the performance of the estimated logistic regression model on the first validation data are used as guidance. A number of variations are tested in practice, each leading to different performances, which we discuss next.

In our case, we can see the relative importance of the independent variables using the variable.importance of the CART trees (see help(rpart.object) in R) or the z-scores from the output of logistic regression. For easier visualization, we scale all values between -1 and 1 (the scaling is done for each method separately - note that CART does not provide the sign of the “coefficients”). From this table we can see the key drivers of the classification according to each of the methods we used here.

=======

Random forests is the last method that we used. Below is the overview of key success factors when it comes to speed dating decision making.

Beloow table shows us key drivers of the classification according to each of the used methods.

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a <<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a
CART 1 CART 2 Logistic Regr. Random Forests - mean decrease in accuracy
attr_o 1.00 1.00 1.00 1.00
sinc_o-0.19 -0.18 -0.10 0.21-0.28 -0.28 -0.17 0.18-0.19 -0.18 -0.10 0.19
intel_o 0.190.18 0.01 0.15
fun_o 0.45 0.44 0.41 0.57
fun_o 0.57 0.57 0.37 0.480.18 0.01 0.14
fun_o 0.45 0.44 0.41 0.55
amb_o -0.20 -0.19 -0.27 0.07
shar_o0.42 0.41 0.58 0.740.40 0.40 0.49 0.560.42 0.41 0.58 0.70
field_cd 0.00 0.00 -0.08 0.19
race 0.00 0.00 0.100.090.12
goal 0.00 0.00-0.08 0.17-0.06 0.12-0.08 0.14
date 0.00 0.000.03 0.200.06 0.140.03 0.18
go_out 0.00 0.00 -0.07 0.16
career_c 0.00 0.00-0.01 0.150.01-0.010.14
sports 0.000.00 0.050.02 0.060.160.00 0.05 0.19
tvsports 0.00 0.00-0.10 0.16-0.16 0.09-0.10 0.17
exercise 0.00 0.000.06 0.180.00 0.150.06 0.18
dining 0.00 0.00-0.05 0.16-0.07 0.14-0.05 0.18
museums 0.00 0.00-0.02 0.180.01 0.16-0.02 0.14
art 0.00 0.00 0.05 0.18
hiking 0.00 0.00 0.06 0.21
gaming 0.00 0.00-0.09 0.16-0.03 0.10-0.09 0.17
clubbing 0.00 -0.01 -0.060.160.150.17
reading 0.00 0.000.00 0.14-0.06 0.120.00 0.18
tv 0.00 0.00-0.03 0.160.06 0.14-0.03 0.16
theater 0.00 0.00-0.08 0.16-0.13 0.15-0.08 0.20
movies 0.00 0.00-0.01 0.16-0.09 0.15-0.01 0.16
concerts 0.00 0.000.01 0.190.06 0.110.01 0.19
music 0.00 0.00 -0.01 0.18
shopping 0.00 0.000.13 0.20-0.05 0.140.13 0.19
yoga 0.00 0.00 -0.11 0.16

In general we do not see very significant differences across all used methods which makes sense.

Step 5: Validation accuracy

1. Hit ratio

Below is the percentage of the observations that have been correctly classified (the predicted is the same as the actual class), i.e. exceeded the probability threshold 50% for the validation data:

<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a
Hit Ratio
First CART 68.13953
Second CART 68.60465
Logistic Regression 69.76744
Random Forests68.3720966.5178668.13953

while for the estimation data the hit rates are:

<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a
Hit Ratio
First CART 75.02181
Second CART 75.48706
Logistic Regression 75.10904
Random Forests99.8255399.8323199.73830

A simple benchmark to compare the performance of a classification model against is the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is people who do not intent do purchase a boat: 227 out of 430 people). Clearly without doing any discriminant analysis, if we classified all individuals into the largest group, we could get a hit-rate of 52.79% - without doing any work.

In our case this particular criterion is met for all the methods that were used.

2. Confusion matrix

The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example for the method above with the highest hit rate in the validation data (among logistic regression, 2 CART models and random forests), the confusion matrix for the validation data is:

Predicted 1 Predicted 0
Actual 1 74.38 25.62
Actual 0 65.64 34.36

3. ROC curve

<<<<<<< HEAD

Remember that each observation is classified by our model according to the probabilities Pr(0) and Pr(1) and a chosen probability threshold. Typically we set the probability threshold to 0.5 - so that observations for which Pr(1) > 0.5 are classified as 1’s. However, we can vary this threshold, for example if we are interested in correctly predicting all 1’s but do not mind missing some 0’s (and vice-versa) - can you think of such a scenario?

When we change the probability threshold we get different values of hit rate, false positive and false negative rates, or any other performance metric. We can plot for example how the false positive versus true posititive rates change as we alter the probability threshold, and generate the so called ROC curve.

The ROC curves for the validation data for both the CARTs above as well as the logistic regression are as follows:

How should a good ROC curve look like? A rule of thumb in assessing ROC curves is that the “higher” the curve, hence the larger the area under the curve, the better. You may also select one point on the ROC curve (the “best one” for our purpose) and use that false positive/false negative performances (and corresponding threshold for P(0)) to assess your model. Which point on the ROC should we select?

4. Lift curve

By changing the probability threshold, we can also generate the so called lift curve, which is useful for certain applications e.g. in marketing or credit risk. For example, consider the case of capturing fraud by examining only a few transactions instead of every single one of them. In this case we may want to examine as few transactions as possible and capture the maximum number of frauds possible. We can measure the percentage of all frauds we capture if we only examine, say, x% of cases (the top x% in terms of Probability(fraud)). If we plot these points [percentage of class 1 captured vs percentage of all data examined] while we change the threshold, we get a curve that is called the lift curve.

The Lift curves for the validation data for our three classifiers are the following:

How should a good Lift Curve look like? Notice that if we were to randomly examine transactions, the “random prediction” lift curve would be a 45 degrees straight diagonal line (why?)! So the further above this 45 degrees line our Lift curve is, the better the “lift”. Moreover, much like for the ROC curve, one can select the probability threshold appropriately so that any point of the lift curve is selected. Which point on the lift curve should we select in practice?

5. Profit Curve

Finally, we can generate the so called profit curve, which we often use to make our final decisions. The intuition is as follows. Consider a direct marketing campaign, and suppose it costs $ 1 to send an advertisement, and the expected profit from a person who responds positively is $45. Suppose you have a database of 1 million people to whom you could potentially send the ads. What fraction of the 1 million people should you send ads (typical response rates are 0.05%)? To answer this type of questions we need to create the profit curve, which is generated by changing again the probability threshold for classifying observations: for each threshold value we can simply measure the total Expected Profit (or loss) we would generate. This is simply equal to:

Total Expected Profit = (% of 1’s correctly predicted)x(value of capturing a 1) + (% of 0’s correctly predicted)x(value of capturing a 0) + (% of 1’s incorrectly predicted as 0)x(cost of missing a 1) + (% of 0’s incorrectly predicted as 1)x(cost of missing a 0)

Calculating the expected profit requires we have an estimate of the 4 costs/values: value of capturing a 1 or a 0, and cost of misclassifying a 1 into a 0 or vice versa.

Given the values and costs of correct classifications and misclassifications, we can plot the total expected profit (or loss) as we change the probabibility threshold, much like how we generated the ROC and the Lift Curves. Here is the profit curve for our example if we consider the following business profit and loss for the correctly classified as well as the misclassified customers:

Predict 1 Predict 0
Actual 1 100 -75
Actual 0 -50 0

Based on these profit and cost estimates, the profit curves for the validation data for the three classifiers are:

We can then select the threshold that corresponds to the maximum expected profit (or minimum loss, if necessary).

Notice that for us to maximize expected profit we need to have the cost/profit for each of the 4 cases! This can be difficult to assess, hence typically some sensitivity analysis to our assumptions about the cost/profit needs to be done: for example, we can generate different profit curves (i.e. worst case, best case, average case scenarios) and see how much the best profit we get varies, and most important how our selection of the classification model and of the probability threshold vary as these are what we need to eventually decide.

=======

The ROC curves for the validation data for all four methods is below:

4. Lift curve

The Lift curves for the validation data for our four classifiers are the following:

<<<<<<< HEAD

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f =======

>>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a

Step 6: Test Accuracy

Below are presented hit ratios for all four methods based on test dataset:

<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f ======= >>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a
Hit Ratio
First CART 76.27907
Second CART 75.81395
Logistic Regression 77.44186
Random Forests76.2790770.9821476.27907

The Confusion Matrix for the model with the best validation data hit ratio above:

Predicted 1 Predicted 0
Actual 1 66 34
Actual 0 86 14

ROC curves for the test data:

<<<<<<< HEAD <<<<<<< HEAD

Lift Curves for the test data:

Finally the profit curves for the test data, using the same profit/cost estimates as we did above:

=======

Lift Curves for the test data:

>>>>>>> fe418090278ee1aa52ae80f8d843bd8364ce481f =======

Lift Curves for the test data:

>>>>>>> 6413c467b6abb2bbc2f7f674f7782eff22a83c0a